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Abstract — Important inference problems in statistical physics, 
computer vision, error-correcting coding theory, and artificial in- 
telligence can all be reformulated as the computation of marginal 
probabilities on factor graphs. The belief propagation (BP) algo- 
rithm is an efficient way to solve these problems that is exact when 
the factor graph is a tree, but only approximate when the factor 
graph has cycles. 

We show that BP fixed points correspond to the stationary 
points of the Bethe approximation to the free energy for a factor 
graph. We explain how to obtain region-based free energy approx- 
imations that improve the Bethe approximation, and correspond- 
ing generalized belief propagation (GBP) algorithms. 

We emphasize the conditions a free energy approximation must 
satisfy in order to be a "valid" approximation. We describe the 
relationship between four different methods that can be used to 
generate valid approximations: the "Bethe method," the "junction 
graph method," the "cluster variation method," and the "region 
graph method." 

The region graph method is the most general of these methods, 
and it subsumes all the other methods. Region graphs also provide 
the natural graphical setting for GBP algorithms. We explain how 
to obtain three different versions of GBP algorithms and show that 
their fixed points will always correspond to stationary points of the 
region graph approximation to the free energy. We also show that 
the region graph approximation is exact when the region graph 
has no cycles. 



I. Introduction 

Problems involving probabilistic inference using graphical 
models are important in a wide variety of disciplines, includ- 
ing statistical physics, signal processing, artificial intelligence, 
and digital communications [1], [2]. Message-passing algo- 
rithms are a practical and powerful way to solve such problems. 
The centrality of such problems and the utility of message- 
passing algorithms for solving them is an explanation for the 
fact that equivalent or very closely-related message-passing al- 
gorithms have now been independently invented many times. 
They are well-known by names like the forward-backward al- 
gorithm for Hidden Markov Models [3], the Viterbi algorithm 
[4], [5], Gallager's sum-product algorithm for decoding low- 
density parity check codes [6], the "turbo-decoding" algorithm 
[7], [8], Pearl's "belief propagation" algorithm for inference on 
Bayesian networks [9], the "Kalman filter" for signal process- 
ing [10], [11], and the "transfer matrix" approach in statistical 
mechanics [12]. 

f MERL Cambridge Research Lab, 201 Broadway, 8th Floor, Cambridge 
MA 02139. yedidia@merl.com 

X Electrical Engineering and Computer Science, MIT Artificial Intelligence 
Laboratory, NE43a, Cambridge MA 02139. wtf@ai.mit.edu 

§ School of Computer Science and Engineering, The Hebrew University of 
Jerusalem, 91904 Jerusalem, Israel, yweiss@cs.huji.ac.il 



In this list of "standard" belief propagation (BP) algorithms, 
we have blurred a distinction between two different objectives 
that one might have, and the slightly different algorithms that 
result. Sometimes, one might be interested in obtaining the one 
global state of a system that is most probable or otherwise op- 
timal. In other cases, one is interested in obtaining marginal 
probabilities for some subset of the nodes of the system, given 
evidence about other nodes in the system. In this paper, we will 
focus exclusively on this latter problem. 

In all standard BP algorithms, messages are sent from one 
node in a graphical model to a neighboring node. The algo- 
rithms are exact when the graphical model is free of cycles. 
Thus, a common approach for dealing with graphical models 
that do have cycles is to try to convert them to equivalent cycle- 
free graphical models, and then to use the standard BP algo- 
rithm [13]. In some cases, this is possible, but for many other 
cases of practical interest, such an approach is intractable, and 
one must settle for approximate methods. 

Fortunately, the standard BP algorithms are well-defined, and 
often give surprisingly good approximate results, for graphical 
models with cycles. Nevertheless, in such cases there are no 
guarantees, and sometimes the results are quite poor, or the al- 
gorithm fails to give any result at all because it does not con- 
verge [ 14] . Two major goals of this paper are to explain why the 
standard BP algorithm often works so well even for graphical 
models with cycles, and to use that understanding to develop 
improved algorithms for cases when it does not work well. 

The class of algorithms that we will describe, which we call 
generalized belief propagation (GBP) algorithms, all have the 
characteristic that sets or regions of nodes will send messages to 
other regions of nodes. The regions of nodes that communicate 
with each other can be easily visualized in terms of a region 
graph. The standard BP algorithm is a special case of a GBP 
algorithm, with a particular choice of regions. Different choices 
of region graphs will give different GBP algorithms, and the 
user can choose to trade off complexity for accuracy. 

In practice, GBP algorithms can often dramatically outper- 
form BP algorithms in terms of either their accuracy or their 
convergence properties, for minimal computational cost, if one 
makes an intelligent choice of regions. However, how to opti- 
mally choose regions for a GBP algorithm remains at this point 
more an art than a science. We hope that this paper contributes 
to this problem by delineating what classes of constructions are 
likely to give good results. 

We shall give a theoretical justification of GBP algorithms 
by showing that their fixed points are identical to the stationary 
points of a region-based free energy, which is an approximation 
to another free energy that can be justified by a rigorous vari- 



ational principle. The first specialized examples of such free 
energies were introduced long ago in the physics literature by 
by Bethe [15] and Kikuchi [16]. For the important special case 
of the standard BP algorithm, we show that its fixed points are 
the same as the stationary points of the Bethe free energy y thus 
establishing an important basic link between a classical algo- 
rithm and a classical approximation from physics. 

One must be careful in constructing a region graph in or- 
der to ensure that the resulting approximations are accurate. 
In our original work introducing GBP algorithms [17], we fo- 
cused on a sub-class of GBP algorithms that were equivalent 
to free energy approximations based on Kikuchi's cluster vari- 
ation method [16], [18], [19]. We shall show that this method 
is only one of a variety of methods to generate region graphs 
and their corresponding free energies and message-passing al- 
gorithms. 

In our original work, we also focused on graphical models 
defined in terms of pair-wise or higher-order Markov random 
fields (MRFs). In this paper, we shall instead focus on graphi- 
cal models defined in terms of factor graphs. All our results can 
be re-expressed for other graphical models without difficulty. 
Using factor graphs has certain practical advantages-in partic- 
ular we can refer the neophyte reader to the excellent review by 
Kschischang et.al. [20]. That review explains the equivalence 
to factor graphs of other graphical models such as Bayesian net- 
works, Tanner graphs for error-correcting codes, or pair-wise 
MRFs, and explains the standard BP algorithm in its various 
guises as an algorithm that operates on factor graphs. 

Other formulations of the standard BP algorithm provide dif- 
ferent insights, and we refer the interested reader to a number 
of important recent papers that exploit alternative views of the 
BP algorithm [21], [22], [23], [24], [25], [26]. 

After our original work which introduced region-based free 
energies and GBP algorithms based on the cluster variation 
method, Aji and McEliece introduced a class of free energy 
approximations and GBP algorithms based on junction graphs 
[27]. One of the goals of this paper is to unify our previous ap- 
proach with the one that Aji and McEliece presented. McEliece 
and Yildirim have independently developed a unified approach 
to belief propagation which is largely equivalent to our region 
graph approach, and we recommend their elegant exposition 
[28]. Pakzad and Anantharam have also recently presented par- 
allel ideas in a brief paper [29]. 

The outline for the rest of the paper is as follows. In section 
II, we review and introduce our notation for factor graphs and 
the standard BP algorithm. In sections HI and IV, we introduce 
and explain the physical intuition behind variational free ener- 
gies and region-based approximations to them. In section V, we 
consider the Bethe Method which can be used to obtain particu- 
larly simple region-based free energy approximations. We also 
show in that section that the standard BP algorithm has fixed 
points equivalent to the stationary points of the Bethe approx- 
imation to the free energy. In section VI, we develop a theory 
that can be used to determine which region-based free energy 
approximations will be likely to give accurate results. In par- 
ticular, we describe the Region Graph Method, a very general 
method for generating valid region graphs and their associated 
free energies. In section VII, we introduce GBP algorithms, and 



show that there are actually a variety of ways to define GBP al- 
gorithms for any given region graph, all of which have identical 
fixed points. We focus on one particular type of GBP algorithm, 
which we call the parent-to-child algorithm. Finally, in section 
VIE, we give a detailed example of the implementation of the 
parent-to-child GBP algorithm. 

We have chosen to put an unusually large amount of mate- 
rial in the appendices of this paper. We did this in an attempt 
to help the reader grasp the fundamental concepts behind our 
work and not lose sight of the forest because of all the trees. 
The appendices describe a variety of other methods to generate 
region graphs and GBP algorithms which could easily prove to 
be as important in practice as the methods described in the main 
text. 

II. Factor Graphs and Belief Propagation 

Let {Xi,X 2l ...,Xm} be a set of N discrete-valued ran- 
dom variables and let Xi represent the possible realizations 
of random variable Xi. We consider the joint probability 
mass function p{X\ = x\,X2 = X2,—,Xn = xn), which 
we shall write more succintly as p(x), where x stands for 
{xi,X2, xn}- We suppose that p(x) factors into a product 
of functions. That is, we suppose that p(x) has the very general 
form : 

a 

Here a is an index labeling M functions /aj/bj/c, — ,/at. 
where the function / a (x 0 ) has arguments x 0 that are some sub- 
set of {xi,X2, ...,x/v}. Z is a normalization constant. 

A factor graph [20] is a bipartite graph that expresses the 
factorization structure in equation (1). A factor graph has a 
variable node (which we draw as a circle) for each variable x^ 
a factor node (which we draw as a square) for each function / Q , 
with an edge connecting variable node i to factor node a if and 
only if Xi is an argument of f a . (We shall always index variable 
nodes with letters starting with i, and factor nodes with letters 
starting with a.) As an example, the factor graph corresponding 
to 

p(xi,X 2 ,X 3 ,X 4 ) = ^fA(xi,X 2 )fB(X2,X3,X A )fc(x4) (2) 

in shown in figure 1. 




Fig. 1. A small factor graph representing the joint probability distribution 

P(X1,X2,X 3 ,X4) = ^fA(x\ t X2)fB(X2,X3,X4)fc{X4). 

We shall focus on the problem of computing marginal proba- 
bility distributions. We call the possible values of Xi the states 
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of variable node i. If 5 is a set of variable nodes, we use 
xs to denote the states of the corresponding variable nodes. 
p s {xs) will denote the marginal probability function obtained 
by marginalizing p(x) onto the set of variable nodes S, i.e., 

ps(x s ) = £ p(x). (3) 

x\x s 

Here the sum over x\xs indicates that we sum over the states of 
all the variable nodes not in the set S. We shall write Pi{xi) for 
the marginal probability function when the set S consists of the 
single node i. One should note that the problem of computing 
marginal probability functions is in general hard because it can 
require summing an exponentially large number of terms. 

The belief propagation (BP) algorithm is a method for com- 
puting marginal probability functions. We describe the algo- 
rithm in terms of operations on a factor graph. As we already 
mentioned in the introduction, the computed marginal proba- 
bility functions will be exact if the factor graph has no cycles, 
but the BP algorithm is still well-defined and empirically often 
gives good approximate answers even when the factor graph 
does have cycles. 

To define the BP algorithm, we first introduce messages be- 
tween variable nodes and their neighboring factor nodes and' 
vice versa. The message m a ^i{xi) from the factor node a to 
the variable node i is a vector over the possible states of X{. This 
message can be interpreted as a statement from factor node a to 
variable node % about the relative probabilities that i is in its dif- 
ferent states, based on the function / a . The message Tii^ a (xi) 
from the variable node i to the factor node a may in turn be 
interpreted as a statement about the relative probabilities that 
node i is in its different states, based on all the information i 
has except for that based on the function / a . 

The messages are initialized to m a ->i(xi) = n»_^a(^i) = 1 
for all factor nodes o, variable nodes z, and states X{. In fact, 
other initializations are also possible, and the overall normal- 
ization of the messages can also be chosen arbitrarily. The only 
important normalization condition is on the beliefs, introduced 
below, which must sum to one in order to properly represent 
probabilities. The messages are updated according to the fol- 
lowing rules: 

rii->a(xi) := JJ m b ->i{xi). (4) 

beN{i)\a 

and 

m a ^i(xi) := £ ^( Xfl ) II n i-«(*i) (5) 

x 0 \xi j€N(a)\i 

Here, N(i)\a denotes all the nodes that that are neighbors of 
node i except for node a, and T* a \ Xi denotes a sum over a11 ^ 
variables x a that are arguments of f a except xi. The messages 
may be normalized in any way that is convenient, as only the 
ratios of the terms in a message are relevant. This standard 
BP algorithm is sometimes called the "sum-product" algorithm 
because of the sum and product that occurs on the right-hand- 
side of equation (5). 

In some cases, it is convenient to eliminate the ni_> 0 (xj) 
messages and write the message-update equations entirely in 



terms of the m a ->i{xi) messages. Alternatively, of course, one 
could choose to eliminate the m a ->i(xi) messages in favor of 
theni_>o(St) messages. 

These message-update rules may initially appear quite 
mysterious-a major goal of this paper will be to explain, jus- 
tify, and ultimately improve upon them. First though, to com- 
plete our preliminary description of the standard BP algorithm, 
we introduce the belief b{{xi) at a variable node i, which is the 
BP approximation to the exact marginal probability function 
Pi{xi). The belief 6j(x») can be computed from the equation 

f>i{xi) oc JJ m 0 -+i(x<), (6) 

a€N(i) 

where we have used the proportionality symbol oc to indicate 
that one must normalize the beliefs so that they sum to one. 
The BP message-update equations are iterated until they (hope- 
fully) converge, at which point the beliefs can be read off from 
equation (6). 

We can also use the BP algorithm to compute joint beliefs 
6 5 (xs) over sets of variable nodes S that may contain more 
than one node. Consider the important case when the set S 
consists of all the variable nodes attached to the ath function 
/ a (x a ). We will denote the corresponding belief by 6 a (x a ), 
which will be given within the BP approximation by 

6 a (x a ) OC / a (x a ) JJ Tli-> a (Xi) 
ieN(a) 

OC /a(Xa) II II (7) 
i€N(a) beN(i)\a 

We can directly derive the message update rules (4) and (5) 
from the belief equations (6) and (7), along with the marginal- 
ization condition 

bi(xi) = M*a) (8) 

Xo\Xi 

which holds when X{ is one of the arguments in the set x a . 
Thus, the belief equations (6) and (7) can be considered to de- 
fine the BP algorithm, a point of view that will prove useful 
later. 

The BP algorithm is normally justified as being an exact al- 
gorithm when the factor graph has no cycles (i.e., it has the 
topology of a tree.) We shall not prove that property here, but 
will simply give a small example: consider the joint probabil- 
ity distribution given by equation (2) as illustrated in figure 1. 
Suppose that we would like to compute pi(xi), the marginal 
probability distribution at variable node 1. Repeatedly using 
the BP equations, we find 

6i(xi) oc mA^\{x\) 

oc ^2fA(xi,x 2 )n 2 ^A(x2) 

a fA(Xi,X2)mB^2{X2) 

oc ^ />4 (xi , x 2 ) /b (*2 , *3 , xa )n 3 -*B (x 3 )n 4 -+B {xa ) 

X2 X$ X4 
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X2 X3 X4 

OC 53 53 fA( X UX2)fB{X2>X$,X A )fc{x A ) 

4 X2 X3 X4 

which is exactly the desired marginal probability function. We 
could similarly demonstrate that equation (7) would give ex- 
act multi-node marginal probabilities for graphs with no cycles. 
We can already see from this example that for graphs with no 
cycles, the BP algorithm is essentially a dynamic programming 
algorithm that organizes the computations necessary to com- 
pute marginal probability distributions in such a way that they 
become tractable. 

The BP algorithm was introduced into the coding literature 
by Gallager as a sub-optimal probabilistic decoding algorithm 
for linear block error-correcting codes, and some readers may 
be most familiar with the BP algorithm in that context [6]. 
Other readers may be most familiar with the form of the BP 
algorithm introduced and popularized by Pearl [9] for proba- 
bilistic inference with Bayesian networks. Readers who are 
more familiar with the BP algorithm written on one of these 
forms may want to consult the review by Kschischang et.al. 
[20], which explains the equivalence between these forms of 
the BP algorithm and the one we have chosen to use here. 

III. Free Energies 

In this section, we turn from simply describing the BP algo- 
rithm to explaining its success. In section II, we saw that the 
BP algorithm can be defined in terms of the belief equations (6) 
and (7). We shall eventually show that these belief equations 
correspond to the stationarity conditions for a functional of the 
beliefs called the Bethe free energy, Fsethe(bi, b a )- This fact 
serves in some sense to justify the BP algorithm even when the 
factor graph it operates on has cycles, because minimizing the 
Bethe free energy is a sensible approximation procedure that 
has a long and successful history in physics. It also points to a 
variety of ways to improve upon or generalize BP, especially by 
improving upon the approximations used in the Bethe free en- 
ergy. In the rest of the paper, we will discuss all of these issues, 
but we first turn to an explanation of the notion of a free energy. 

Suppose that one has a system of N particles, each of which 
can be in one of a discrete number of states, where the states 
of the zth particle are labeled by Xj. (As an example, one 
might make a variety of simplifications and characterize the 
states of the atoms in a magnetic crystal by whether a given 
electron in each atom has an "up" spin or a "down" spin.) 
The overall state of the system will be denoted by the vector 
x = {xi,X2, ...,£/v}. Each state of the system has a corre- 
sponding energy E(x). A fundamental result of statistical me- 
chanics is that, in thermal equilibrium, the probability of a state 
will be given by Boltzmann's Law 

P(x) = -^rf E( * ),T - (10) 

Here, T is the temperature, and Z(T) is simply a normalization 
constant, known as the partition function : 

Z(T) = Y,e~ E{x)/T (ID 

x6S 



where 5 is the space of all possible states x of the system. 

A substantial part of statistical mechanics theory is devoted 
to trustification of Boltzmann's Law. On the other hand, if 
one begins with a joint probability distribution p{x) for some 
non-physical system, one can view Boltzmann's law as a pos- 
tulate that serves to define an energy for the system, where the 
temperature can be set arbitrarily, as it simply sets a scale for 
the units in which one measures energy. We shall take this 
point of view and set T = 1 throughout the rest of this paper. 
For the case of a factor graph probability distribution function 
p(x) = (l/Z) niLi /a(x a ), we define the energy E(x) of a 
state x to be 

M 

£M=-£ ln /a(x fl ) (12) 
a=l 

in order to be consistent with Boltzmann's Law. 
The Helmholtzfree energy fkeimhoitz of a system is 

^Helmholtz = — 1*1 Z. (13) 

This free energy is a fundamentally important quantity in statis- 
tical mechanics, because if one can calculate the functional de- 
pendence of -Fkeimhoitz on quantities like a macroscopic mag- 
netic field H or temperature T\ then it is easy to compute ex- 
perimentally measurable quantities like the response of the sys- 
tem to a change in H or T. Physicists have therefore devoted 
considerable energy to developing techniques which give good 
approximations to jpHeimhoitz- 

One important technique is based on a variational approach. 
Suppose again that p(x) is the true probability distribution of 
the system, which obeys Boltzmann's Law p(x) = e" E ^/Z. 
It may be that even if we know p(x) exactly, it is of a form 
that makes the computation of FHeimhoitz difficult. We there- 
fore introduce a "trial" probability distribution 6(x), and a cor- 
responding variational free energy (often called the Gibbs free 
energy) defined by 

F(b) = U(b)-H(b). (14) 
where U(b) is the variational average energy: 

= J>(x)2S(x) (15) 

and H(b) is the variational entropy : 

Jf(6) = -£b(x)ln6(x). (16) 

It follows directly from our definitions that 

F(b) = F Helmhohz + D(b\\p) (17) 

where 

D(^)E^5(x)hg (18) 
xes p ^ ' 

is the Kullback-Leibler divergence between 6(x) and p(x). 
Since there exists a theorem [30] that £(b||p) is always non- 
negative and is zero if and only if 6(x) = p(x), we see that 
F(b) > F He imhoitz, with equality precisely when 6(x) = p(x). 
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Minimizing the Gibbs free energy F(b) is therefore an exact 
procedure for computing FHeimhoitz and recovering p(x). Of 
course, as N becomes large, this procedure is also totally in- 
tractable, as 6(x) will take exponentially large memory just to 
store. A more practical possibility is to upper-bound Fneirahoitz 
by minimizing F(b) over a restricted class of probability distri- 
butions. This is the basic idea underlying the mean field ap- 
proach. One very popular mean-field form for 6(x) is the fac- 
torized form: 

N 

w(x)=n 

i=l 

Using this &mf(x), and an energy function E(x) of the factor 
graph form given in equation (12), we can easily compute the 
mean field free energy F M f = Umf — Hmf for an arbitrary 
factor graph: 

M 

E/MF({bi ? ...,6W)=-EE in ^( x -) n b ^ 

a-l x 0 ieN(a) 

(20) 

N 

H M F({bi,...,b N }) = -.£2>(xi)ln6,(*«)- (21) 

Minimizing FMp(bi, ... , bj\?) over the b{ will give us self- 
consistent equations for the bi, which can be solved numerically 
to obtain a mean-field approximation for the beliefs bi(xi). 

Instead of a factorized form, one might consider other more 
complicated forms for &(x) which still lead to tractable approx- 
imations. This is the idea behind the "structured mean-field" 
approach [31]. We will not follow that path, and will instead 
describe a quite different approach to approximating F(b) in 
the next section; one which underlies the BP algorithm. 

IV. Region-based Free Energy Approximations 

Kikuchi and the other physicists who further developed the 
so-called cluster variation method [16], [18], [19] introduced 
a class of approximations to the Gibbs free energy F(b). The 
idea behind these approximations is similar, but slightly differ- 
ent from the mean field approximation. Whereas the factorized 
mean-field free energy Fmf is a function of single-node beliefs 
bi(xi), in a Kikuchi approximation, the approximate free en- 
ergy Fjcikuchi will be a function of beliefs bs(xs) over larger 
sets S of variable nodes. One drawback of the cluster varia- 
tion method is that in contrast with the mean-field approach, 
one cannot normally explicitly construct an overall "trial" be- 
lief vector 6(x) that is consistent with the multi-node beliefs 
&s(xs), and therefore one does not normally obtain any upper 
bound on F [32]. On the other hand, one can make approxi- 
mations that are much more accurate than the factorized mean- 
field approximation, and there is a great deal of flexibility in the 
exact choice of approximation. As we shall also see in further 
detail, these approximations can be exploited to yield message- 
passing algorithms, and a particularly simple version-the Bethe 
approximation-will give results that are equivalent to the stan- 
dard BP algorithm. 

We shall actually describe here a class of approximations that 
generalize those generated by the cluster variation method as it 



has been described in the physics literature, and will therefore 
refer to such approximations as region-based approximations. 
We refer to the sub-class of approximations specifically gener- 
ated using the cluster variation method as Kikuchi approxima- 
tions. 




Fig. 2. An illustration of the definition of a region. Regions are sets of variable 
and factor nodes in a factor graph such that all variable nodes connected to 
any included factor nodes are included. Thus, the sets of nodes {1,2} and 
{ £?, C, 2, 3, 4} could be regions, but {B, 3} could not be a region (since factor 
node B was included, variable nodes 2 and 4 must also be included.) 

We begin by assuming thatp(x) has the factor graph form of 
equation (1). We define a region R of a factor graph to be a set 
Vr of variable nodes and a set Fr of factor nodes, such that if a 
factor node o belongs to Fr, all the variable nodes neighboring 
a are in Vr. We give examples of sets of nodes that could or 
could not be considered regions in figure 2. Note that the set 
Fr may be empty, and that a factor a need not be included in 
Fr even if all its neighboring variable nodes are in Vr. , 

We define the state xr of a region R to be the collective set 
of variable node states {xi\i € Vr}. The marginal probability 
function over a region R will be denoted by pr(xr), by which 
.we mean a marginalization of j>(x) onto the variable nodes in 
Vr. The corresponding belief &k(xj*) will be an approximation 
to the true (xfl). 

We define the region energy Er(xr) to be 

Er(xr) = - £ ln/ G (x a ). (22) 

a€F R 

where, since all the variable nodes neighboring a factor node 
a E Fr are guaranteed to be in the region R, we can always 
determine any needed state x 0 from the state xr. We further 
define the region average energy Ur^r), the region entropy 
Hr^r), and the region free energy Fr^r), by 

U R (b R ) = J2bR^R)ER(xR) (23) 

H R (b R ) = - £ M*«) In MX*) (24) 

and 

F R (bR) = U R (bn)-H R (b R ). (25) 

The intuitive idea behind a region-based free energy approx- 
imation is that we will try to break up the factor graph into a set 
of large regions that include every factor and variable node, and 
say that the overall free energy is the sum of the free energies of 
all the regions. Of course, if some of the large regions overlap, 
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then we will have erred by counting the free energy contributed 
by some nodes two or more times, so we then need to subtract 
out the free energies of these overlap regions in such a way that 
each factor and variable node is counted exactly once. 

To make these notions precise, we say that a region-based 
approximation Fn for the Gibbs free energy will be defined in 
terms of a set of regions TZ, and an associated set of counting 
numbers cr. cr will always be an integer, but might be zero or 
negative for some R. 

We say that a set of regions 1Z and counting numbers cr give 
a valid region-based approximation when, for every factor node 
o and every variable node z in the factor graph, 



(26) 



Ren 



R€TL 



where [z € Z] is the set-membership indicator function equal 
to 1 if z £ Z and equal to 0 otherwise. 

These conditions ensure that every factor and variable node 
will be counted exactly one time in the approximation to the 
free energy. If a given factor or variable node is added into the 
free energy in two different regions, then there must be another 
region where it is subtracted back out. 

Given a valid set of regions 1Z and counting numbers cr, the 
region-based approximation to the Gibbs free energy is simply 

Fn({b R }) = Y, c*FrQ>*)- < 27 > 
Ren 

Note that the region-based average energy 

Un({b R }) = '^QnUaiba) 
Ren 

= ~ £ c*£M*ii) £ ln/a(x a X28) 
Ren x fl a eF R 

will always be exact, provided that the beliefs {i>j*(xjt)} 
are equal to the corresponding exact marginal probabilities 
{Pr{xr)}. We can see this by comparing with the exact av- 
erage energy 

M 

x£S a=l x a 

and noting that in the overcounting numbers cr guarantee that 
each factor node is counted exactly once in equation (28), and 
that if all the are exact in equation (28), they will properly 
marginalize to give the necessary factors of p 0 (*a) * n equation 
(29). 

On the other hand, the region-based entropy 



Ren 

Y^CR^2bR^ R )\nbR(xR) (30) 



Ren xfi 



will normally only be an approximation even if the beliefs 
bR(xR) were exactly equal to the true marginal probabilities, 
although the condition that each variable node is counted once 



makes it a quite "reasonable" approximation, in the sense that 
if the probability distribution p(x) was flat, this entropy would 
at least count the number of degrees of freedom correctly. The 
region-based entropy will also be exact in certain cases that we 
describe later. 

How does one select a valid set of regions 1Z and counting 
numbers cr for a given factor graph? There are in fact an infi- 
nite number of ways to do that. In the next section we will de- 
scribe a very straightforward approach which we call the Bethe 
method, which is guaranteed to return valid sets of regions and 
counting numbers. We then prove that the fixed points of the 
standard BP algorithm correspond to stationary points of the 
Bethe approximation to the free energy. 

In the following section, we will introduce the region graph 
method, which is a very general approach to finding valid ap- 
proximations, based on constructing a region graph. Region 
graphs play a central role in the description both of the re- 
gion graph free energy, and in the construction of corresponding 
GBP algorithms, and provide the clear way of visualizing and 
understanding a region-based approximation. 

The Bethe method is a special case of the much more gen- 
eral region graph method. In appendices A and B, we discuss 
two other important methods that are also special cases of the 
region graph method: the junction graph method and the clus- 
ter variation method. In appendix C, we discuss in detail the 
relationship between the different methods. 




Fig. 3. A factor graph which we use to illustrate a variety of region-based free 
energy approximations. 



V. The Bethe Method 

The origins of the Bethe method date back to 1935, 
and Bethe's famous approximation method for magnets [15]. 
Kikuchi, in his 1951 paper that pioneered the cluster variation 
method [16], recognized that Bethe's approximation was the 
simplest example of an approximation that could be generated 
using that method. Of course, from the modern point of view, 
these early papers focused on very special graphical models, 
and we warn the reader who wants to read the original papers 
that our description of Bethe's and Kikuchi's methods will bear 
little resemblance to their expositions. 

In the Bethe method, we take the set of regions included in 11 
to be of two types. First, we have a set of large regions 71l such 
that the M regions in Hl each contain exactly one factor node 



6 



and all the variable nodes neighboring that factor node. Second, 
we have a set of small regions Us* such that the N regions in 
Us each contain a single variable node. 

We take as an example the factor graph shown in figure 3, 
which has six factor nodes which we label A,B,C,D,E y F 
and nine variable nodes which we label 1,2,. ..,9. For 
this example, we would have the following large regions 
in Tl L '. {4,1,2,4,5}, {5,2,3,5,6}, {C,4,5}, {£>,5,6}, 
{£,4,5,7,8}, and {F,5,6,8,9}, and the following small re- 
gions in U S : {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, and {9}. 
The complete set of regions fcsethe included in the Bethe ap- 
proximation is 7?-Bethe = T^L U Us> 

If Ri and R 2 are two regions, we say that Ri is a sub-region 
or i? 2 and R 2 is a super-region of R\ if the set of variable and 
factor nodes in R\ are a subset of those in R 2 . 

In the Bethe method, the counting numbers cr for each re- 
gion R G U are given by 

c* = l- £ °* (31) 
ses(R) 

where S(R) is the set of regions that are super-regions of R. 

Using this definition we see that for every region R G TZl, 
c R = 1, while for every region R G 7^5, cr = 1 - a\, where d{ 
is the degree (number of neighboring factor nodes) of the vari- 
able node i. It is easy to confirm that the Bethe approximation 
will always be a valid approximation, as each factor and vari- 
able node will clearly be counted once as required in equation 
(26). 

We can use our expressions for cr in equation (28) to obtain 
the Bethe approximation to the Gibbs free energy i*Bethe = 

C^Bethe - # Bethe, where 

M 

C/Bethe = " £ £ *>a (*.) In /.(Xa) (32) 
a=l x a 

and 

M 

Hsethe = - £ £ «>a(Xa) In 6 a (x a ) 

Q=l X a 

+ £>-l)J>(*0 In 03) 

1=1 Xi 

Note that the Bethe entropy will be exact if the factor graph has 
no cycles, because in that case we have the exact formula [13] 

which we can substitute into the formula for the variational en- 
tropy to recover i^Bethe- 

We shall now show that minimizing the Bethe approximation 
to the free energy will always give results that are equivalent to 
the standard BP algorithm, so the exactness of the Bethe ap- 
proximation for factor graphs with no cycles is no surprise. 



A. Equivalence of the Bethe Approximation and Standard BP 

We now show that the standard BP algorithm is equivalent to 
the Bethe approximation, and explore some of the implications 
of that equivalence. In particular, we show that the "messages" 
sent in BP are exponentiated combinations of Lagrange multi- 
pliers. 

Theorem: Let {m a -+i{xi),ni-+ a {xi)} be a set of BP mes- 
sages and let {f>a(x 0 ),&i(xt)} be the beliefs calculated from 
those messages. Then the beliefs are fixed points of the BP al- 
gorithm if and only if they are zero gradient points of the Bethe 
free energy i<Bethe> subject to the constraint that all the beliefs 
are normalized and consistent. 

Proof: We want to minimize the Bethe free energy, while in- 
sisting that all the beliefs bi(xi) and 6 a (x a ) are consistent. To 
this end, we add Lagrange multipliers \ a (%%) which enforce the 
constraint that Z) Xo \ Xi M x a) = M^i) for every factor node 
a and all its neighboring variable nodes i. We also need to add 
Lagrange multipliers to normalize the beliefs, but we do not 
clutter our equations with them, as their effects are automati- 
cally taken into account if we simply normalize our beliefs. 

Setting the derivative of the resulting Lagrangian L Be the with 
respect to the beliefs bi(xi) and b a (x a ) equal to zero gives: 

MXa)0C/a(x a ) J] (35) 
ieN(a) 

and j 

bi(xi) oc ( J] e^'A , (36) 

If we make the identification 

*ai(xi) = \nn ia {xi) - In JJ m a ^i{xi) (37) 

then we find that we recover the standard BP belief equations 
(6) and (7), which means that the standard BP fixed points 
correspond to stationary points of the constrained Bethe free 
energy.* 

The fact that L Be the is bounded below implies that the BP 
equations always possess a fixed point (obtained at the global 
minimum of I^Bethe)- To our knowledge, this is the first proof 
of the existence of BP fixed points for a general graph with ar- 
bitrary potentials. Of course, the existence of a fixed point does 
not imply that the BP algorithm will converge starting from ar- 
bitrary initial conditions. 

The conditions for the uniqueness of BP fixed points are also 
clarified by the equivalence with the Bethe approximation. In 
graphs with no more than a single cycle, it was known that if 
all factors are strictly positive (/ a (x Q ) > 0 for all a and x a ), 
then there was a unique BP fixed point.[33] For general graphs, 
we can use the equivalence established above to answer a ques- 
tion about the uniqueness of stationary points for the Bethe free 
energy. The issue of the number of stationary points of ap- 
proximate free energies is well studied in physics. To be more 
precise, we can imagine defining a sequence of probability dis- 
tributions where some or all of our original functions are all 
raised by a power: / 0 (x 0 ;r) = / Q (x) 1/T . This is equiva- 
lent to changing the temperature in a physical system, where 
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T is the temperature. Many systems, for example Ising fer- 
romagnets, will have different numbers of solutions above or 
below a critical temperature T c within the Bethe approxima- 
tion [34]. Above T c , the constrained free energy is convex and 
has a unique stationary point, while below T c > there are multi- 
ple stationary points. Using this equivalence it is easy to define 
small factor graphs that show a similar behavior. Although the 
topology does not change and the factors are always positive, 
as we smoothly change the factors we go from a regime with a 
unique fixed point to one with multiple fixed points. 

While we have shown that standard BP can only converge 
to stationary points of the constrained Bethe free energy, it 
is important to realize that BP does not perform constrained 
minimization of the Bethe free energy; i.e. it does not de- 
crease FBethe at every iteration. Indeed, the marginalization 
constraints are typically not satisfied at intermediate iterations 
of BP: it is only at a fixed point that the beliefs are in a fea- 
sible set. Based on the equivalence, first noted in our earlier 
work [17], others have recently devised algorithms that directly 
minimize the free energy on the feasible set [35], [36], [37], 
Such free energy minimizations are somewhat slower than the 
BP algorithm, but they are guaranteed to converge. 



where A(u) is the set of vertices that are ancestors of u. Thus, 
the counting numbers for the regions of a region graph corre- 
spond to the Mobius function of the corresponding partially or- 
dered set [38]. 

For a graph Q to qualify as a region graph, we further insist 
on the region graph condition, which requires that for every 
i € I, the subgraph G{%) - (V(i),E(i),L(i)) formed by just 
those vertices whose labels include i is a connected graph that 
satisfies the condition 



£ * = 

vev(i) 



(39) 



Having defined region graphs, it is almost trivial to define 
a corresponding method for generating valid region-based free 
energy approximations. We simply create a region graph such 
that the vertices correspond to regions, with labels correspond- 
ing to the factor and variable nodes in a region, and we re- 
quire that every factor and variable node be contained in at 
least one region. We associate the counting numbers cr for 
regions direcdy with the counting numbers c v for the region 
graph, and the region graph free energy Frq will be given by 
Frg = J2r c rFri where Fr is the free energy of the region 
R. 



VI. The Region Graph Method 

We now introduce region graphs, which are central to the 
region graph method for generating valid free energy approxi- 
mations, and also will provide a graphical framework for GBP 
algorithms. 

Let J be the set of indices for the factor and variable nodes 
in a factor graph. A region graph is a labeled, directed graph 
Q = (V, E, L) in which each vertex v € V (corresponding to 
a region) is labeled with a subset of J. We denote the label of 
vertex v by L(y). A directed edge (or arc) may exist pointing 
from vertex v p to vertex v c if L(y c ) is a subset of L(v p ). If such 
an arc exists, we say that v c is a child of v p , that v p is a parent of 
v c , and that they belong to different generations.. If there exists 
a directed path from vertex v a to vertex Vd, we say that v a is an 
ancestor of Vd, and Vd is a descendant of v a . Note that because 
of the transitivity of the subset relationship, a region graph must 
be a directed acyclic graph, in the sense that the arrows cannot 
loop around. 

A region graph is closely related to the Hasse diagram for a 
partially ordered set, or poset [38], if we consider our regions 
to be organized into a poset, with the ordering relationship be- 
tween the regions to be given by the ancestor-descendant rela- 
tionship [28], [29]. There are, however, some differences be- 
tween region graphs, and Hasse diagrams. First, region graphs 
are labeled graphs, and we will insist on some "region graph 
conditions," described below, that the labels must satisfy. Sec- 
ond, region graphs can include an arc between two regions that 
are also connected by a path of length two or greater, which is 
forbidden for Hasse diagrams. 

We define a counting number Cy for every vertex in the region 
graph, by 

c„ = l- £ < 38 > 
ueA{u) 



c = -) 



c = l 



i A,C f l,2,4,5 



H3- 



B,D,2,3,5,6 



C,4,5 



c = -2 



I.. 



C,EA5,7,8 



c = -l 



L 



F,5,6,8,9 



Fig. 4. An example of a region graph. We have listed the counting number cr 
next to each region. 

In figure 4, we give an example of a region graph for the fac- 
tor graph that we already introduced in figure 3. This region 
graph was constructed to demonstrate what is and is not per- 
mitted in a legal region graph, rather than what would likely 
give good results. Note that a region graph need not obey some 
properties that one might consider important (including some 
which are enforced in the junction graph method described 
in appendix A and the cluster variation method described in 
appendix B). For example there need not be any clear delin- 
eation of "generations" (region {8} is a child of both regions 
{C, £7,4,5, 7, 8} and regions {F,5,6,8,9}, while region {5} 
is a grand-child of region {C, £7,4,5, 7,8} and a child of re- 
gion {F y 5, 6, 8, 9}.) Note also that regions may have counting 
number equal to zero (e.g. region {5, 6}), and that the fact that 
a region is a sub-set of another region need not imply that it is 
also a descendant of that region (e.g. regions {F, 5, 6, 8, 9} and 
{5,6}). 

What is essential is that the region graph conditions that we 
described above are obeyed. We insist on these conditions for 
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the following reasons. First, to reiterate the comments we made 
about valid region-based free energy approximations, the con- 
dition that every factor node in the factor graph is counted once 
when we do the weighted sum over all regions ensures that the 
region graph average energy is exact if the region beliefs are ex- 
act; and the condition that every variable node is counted once 
ensures that the region graph entropy is a reasonable approxi- 
mation. The condition that the regions containing a particular 
variable node form a connected sub-graph will ensure that the 
marginal probability at any node is consistent irrespective of 
which region's beliefs one uses to compute it. Empirically, we 
have found in limited experiments that if one attempts to run 
a GBP algorithm (as described below) on graphs that do not 
satisfy all the region graph conditions, the results will be poor. 




c = l 



A,C,1,2,4,5 



c = -\ 



c = l 



2,5 



B,D,2,3,5,6 



U 



C,4,5 



c = -l 



D,5,6 



c = \ 



c = l 





c - 


-1 




C3,4,5,7,8 




5,8 




D^,5,6,8,9 







Fig. 5. An example of a graph of regions that is not a region graph because the 
sum of the counting numbers of regions containing variable node 5 is not one. 

An example of a "false region graph" or graph of regions that 
does not satisfy the region graph conditions is shown in figure 5. 
The problem with this plausible-looking construction is that the 
sum of the counting numbers of the regions containing variable 
node 5 is zero, rather than one. We could modify this false 
region graph in a variety of ways to obtain a real region graph. 
For example, we could simply remove node 5 from the region 
{2, 5}. The resulting region graph would be an example of a 
junction graph; see appendix A. Alternatively, we could add a 
region {5} which just contained variable node 5, and connect 
the regions {2,5}, {C,4,5}, {£,5,6}, and {5,8} to it (the 
result of using the cluster variation method; see appendix B). 

Just as the Bethe approximation will be exact when the fac- 
tor graph is a tree, a region graph approximation will always be 
exact when the corresponding region graph is a tree. This can 
be demonstrated by recursively applying the following junction 
graph formula for the probability distribution of a factor graph 
divided into large regions Hl . and small regions Us which sep- 
arate the large regions (see Appendix A for more details): 



p(x) = 



(40) 



YliieTisPRixR)*"- 1 ' 

We illustrate the idea with an example, that has the factor 
graph given in figure 6, and the region graph given in figure 
7. We will recursively break down the full joint probability 
distribution and show that it is equal to a product of marginal 
probability distributions over regions that has precisely the form 
necessary so that the region graph free energy is exact. 



Fig. 6. A factor graph that has a tree region graph shown in figure 7. 




Fig. 7. A region graph with no cycles that has a corresponding region graph 
free energy approximation which is exact. 



Note that for this region graph, the region {4} separates the 
left part of the tree and the right part of the tree. That means 
that we have 



p(xi,...,s 7 ) = 



p(Xi , S 3 , S 4 , S 6 )p(x 2 , S 4 , S 5 , S 7 ) 

p(x 4 ) 



(41) 



The marginal probability distributions p(si,S3,S4,x 6 ) and 
p(x2, X4,x$,X7) can in turn be written in terms of marginal 
probabilities of smaller regions. For example, we see that the 
region {3, 4} separates the regions {A, 1, 3, 4} and {C, 3, 4, 6}, 
so that 

„r„ „ _ _ \ _ P( x i ,x 3 ,x 4 )p(x 3 ,s 4 ,s 6 ) (A o\ 

P(*1,S3,*4,S6) " ^X7) ' (42) 

Expanding everything out, we obtain that the joint probability 
distribution p(x\ , X7) equals 



p(X\ , S3 , X4 )P(X3 , X4 , S 6 )p{X2 , S 4 , S 5 )jj(s 4 , S 5 , S 7 ) 
p(x 3 ,X4)p(x4,X b )p{X4) 



(43) 



Substituting this result into the formula for the exact entropy, 
we recover the region graph entropy. Since the region graph 
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average energy is always exact when the region beliefs are, this 
demonstrates that the approximation is exact in this case. 

We note that each term in the numerator of the expression 
(43) has a power of 1, and each term in the denominator has 
a power of — 1, corresponding exactly to the counting number 
of the corresponding region. In the general case of a region 
graph with no cycles, the recursive application of the junction 
tree separator formula (40) will always similarly reproduce the 
counting numbers given by the region graph prescription equa- 
tion (38). 

We have already seen that the stationary points of the Bethe 
approximation to the free energy are equivalent to the fixed 
points of the standard BP algorithm, which operates on a factor 
graph. In the following sections, we shall introduce generalized 
belief propagation algorithms which operate on region graphs, 
and demonstrate that their fixed points correspond to the sta- 
tionary points of the region graph free energy. 

In appendices A, B, we discuss two other methods (the junc- 
tion graph method and the cluster variation method) that gen- 
erate valid free energy approximations. Both of these methods 
can be considered special cases of the region graph method. In 
appendix C, we describe the relationship between all the differ- 
ent methods described in this paper in more detail. 



A. The Parent-to-Child Algorithm 

As we saw, the standard BP message-passing equations can 
be derived using the fact that the belief at a single variable node 
is just the product of all the messages bearing information from 
neighboring factor nodes, while the belief at the region of vari- 
able nodes adjoining a single factor node is the product of their 
internal factors, multiplied by all the messages coming into the 
group of nodes from factor nodes outside the region. 

The parent-to-child algorithm generalizes this idea. In this 
algorithm (which in previous expositions we called the "canon- 
ical" GBP algorithm [17]) the belief at any region R will be the 
product of the local factors in that region, multiplied by all the 
messages coming into region R from outside regions. There is 
one complication, however: to make the algorithm equivalent 
to minimizing the region graph free energy, we need to include 
additional messages into regions which are descendants of R 
from other parent regions that are not themselves descendants 
of region R. 

To be more specific, in the parent-to-child algorithm, we only 
have one kind of message mp_>j*(xit) from a parent region to 
a child region. Each region R has a belief 6 j r(x j r) given by 



a€A R \PeV(R) 



VII. Generalized Belief Propagation Algorithms 

Just as the standard BP algorithm corresponds to the Bethe 
approximation, one can construct generalized belief propaga- 
tion (GBP) algorithms corresponding to any region graph free 
energy approximation. In fact, there are many ways to construct 
message-passing algorithms whose fixed points are equivalent 
to the stationary points of a region graph free energy. In all these 
algorithms, messages of some sort are sent between regions on 
a region graph. 

One should first note that one can obtain different GBP algo- 
rithms corresponding to the same free energy by using different 
region graphs that have the same free energy. For example, one 
could modify a region graph by connecting a grandparent re- 
gion directly to a grandchild region. The GBP algorithms that 
we describe below would be modified, but the approximate free 
energy would not be changed. Making such a modification will 
alter the dynamics of a GBP algorithm, but not its fixed points. 

Even if one fixes one's attention on a particular region graph, 
there are still a variety of different GBP algorithms that one 
can create. In the main text of this paper, we will describe one 
possible approach, which we call the parent-to-child algorithm. 
In appendices D and E, we describe two other approaches (the 
child-to-parent algorithm and the two-way algorithm) which 
give algorithms with equivalent fixed points, and which have 
their own advantages. An important advantage of the parent- 
to-child algorithm is that the message-passing algorithm makes 
no reference to region counting numbers, just as in the standard 
BP algorithm. 

The standard BP algorithm is a special case of all three al- 
gorithms when the region graph is obtained using the Bethe 
method. 



[J H rnp'^n(xD) | (44) 

^DeV(R) P'€V(D)\£(R) 

Here V(R) is the set of regions that are parents to region R, 
V(R) is the set of all regions that are descendants of region R, 
£(R) =-RU V{R) is the set of all regions that are descendants 
of R and also region R itself, and V(D)\£(R) is the set of all 
regions that are parents of region D except those that are also 
descendants of region R or region R. 




Fig. 8. A region graph used to illustrate the parent-to-child GBP algorithm. 
Note that we do not explicitly give the variable and factor node labels for each 
region, as for our purposes, we are only interested in the topology of the region 
graph. 

An example may help make the belief equation clearer. Con- 
sider the example shown in figure 8. The belief bn(xji) at re- 
gion R is the product of its local factors rioe^H /a( x «)' ^ e 
messages from its parents m^j^x^) and mB->R(xR)i and 
the messages into descendants from other parents who are not 
descendants: mc->£ (x^), mc^i/(xff), and tyif-^h{xh )• 
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One obtains self-consistent equations for the messages by re- 
quiring consistency between the beliefs between every pair of 
parent and child regions. Thus in figure 8, we might focus on 
the region R and its child E. The belief at region R is given by 

bR oc rriA->R ™>b-*r ™>c->e ™>c-+h ttif^h JJ /o(x 0 ) 

a€A R 

(45) 

(where we have lightened the notation by removing the obvi- 
ous functional dependencies of the messages) and the belief at 
region E is given by 

bs oc tur^e mc->E TTId^G ™>C->H ™>F->H JJ /o(x a ) 

a£A E 

(46) 

Using the marginal ization constraint, bn{xji) = 
Sx4\x R ^( x ^) we obtain a relation between messages 
that we can interpret as the message update rule 

53 m A ^ R (x R )rn B ^ R (x R ) Yi /"( x °)* < 47 ) 

xr\xb f a€A R \A B 

Of course, similar message update rules would be obtained 
for all the pairs of parent and children regions. There will be 
enough conditions to determine every message. 

In general, the parent-to-child message- update rules will be 

™>p^r{xr) := 

£xp Xfl R«zFp\r /q(Xq) U(i,j)eN(p,R) m *^Axj) 
U(i 9 j)^D(p tR ) m M^j) 

where the sets N(P y R) and D(P,R) can be calculated in ad- 
vance. Recall that £{R) = R U V(R). Then N(P, R) is the set 
of all connected pairs of regions (J, J) such that J is in £{P) 
but not £{R) while J is not in E(P). D(P, R) is the set of all 
connected pairs of regions (J, J) such that J is in £(R), while 
J is in£(P),but noi£{R). 

We now prove a central theorem about the parent-to-child 
GBP algorithm, which is defined by the message update rules 
(48) combined with the belief equations (44). 

Theorem: A set of messages and beliefs define a fixed point 
of the parent-to-child GBP algorithm if and only if the beliefs 
are a stationary point of the region graph free energy, where the 
region graph free energy is constrained to have beliefs that are 
consistent and normalized. 

Proof: To simplify the proof, we will assume that no region 
R in the region graph has counting number cr = 0. In appendix 
F, we discuss this technically useful assumption in detail. In 
particular, we show that it is easy to remove any cr = 0 regions 
to get an equivalent region graph; and also that even if we do 
permit them, the parent-to-child GBP algorithm will still work 
properly, although the following proof no longer holds. 

Recall that the region graph free energy is simply 

Fn({b R }) = £ c R F R (b R ). (49) 
Ren 

To derive the stationarity conditions, we need to create a La- 
grangian L for the free energy which enforces consistency be- 
tween the beliefs in every pair of connected regions. To that 



end, we add Lagrange multipliers Ai>c( x c) which enforce that 
M x c) = 53 bp ( Xp> > < 50 > 

xp\x c 

for every pair of parent and child regions P and C. 

Of course, we also need to include Lagrange multipliers 
which enforce the normalization of the beliefs: Y^* R ta( x Ji) = 
1. Setting the derivatives of L with respect to the beliefs 
&/*( x ii) equal to zero gives us the following stationarity con- 
ditions: 

CR In b R {K R ) =1R + CR ^ ln /a( x a)- 
a€A R 

- 53 Xpr(*r)+ 53 a *c(xc), (51) 

PeV(R) C€C(R) 

where V(R) is the set of regions that are parents of region R, 
and C(R) is the set of regions that are children of region R. In 
this expression, x Q and xc are entirely determined by the value 

OfXfl. 

Our proof will now work backwards from the belief equa- 
tions that we want to derive. We want to show that there exists 
a "rotation" from our Lagrange multipliers A to another set of 
Lagrange multipliers /* such that the stationary point conditions 
can be re-written as 

cr ln b R (xfl) = 7h + cr 53 ln/a(x a )... : ( 52 ) 

a£A R 

+ 53 *W x ii) + 53 53 Mp'd(xd). 

P€V(R) DeV(R) P' eV(D)\£(R) 

Clearly, if we can show this, then by identifying the message 
mp-+ R (x R ) = exp(/xp jR (xi 2 )), we will recover our desired 
belief equations. 

So what do the Lagrange multipliers iip R (x R ) constrain? 
The answer is that they impose the constraint 

c R b R (x R ) + 53 c >* 53 M x >t) = 0. (53) 

A€A(R)\(PUA(P)) x,\x fl 

In words, the Lagrange multiplier p,pR constrains the weighted 
belief in region R plus the sum of the weighted beliefs in all the 
ancestor regions of region R, except for regions P and all its 
ancestors, to be equal to zero. If we make a Lagrangian using 
these Lagrange multipliers, it is straightforward to work out that 
its stationary points are given by equation (52). 

Now we need to show that the new set of \i Lagrange multi- 
pliers and their associated constraints are equivalent to the old 
set of A Lagrange multipliers and their constraints. We first note 
that because c R +^ A€A(R) c A = l>™dc P + 52 AeA(P) C A = 
1, we can subtract these two equations and obtain 

cr + 53 CA = 0 (54) 

AeA(R)\(PUA(P)) 

If we start with the Apc( x c) constraints that be ( x c) = 
£x \xc &M x iO f° r evei 7 P a ir of parent and child regions, we 
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can use equation (54) as a basis for deriving the constraints as- 
sociated with the /x Lagrange multipliers. It is also always pos- 
sible to go in the other direction: The fi constraints will be lin- 
early independent, so that if we begin with them, we can derive 
the A constraints [28]. (This is where the proof breaks down 
if there are regions with counting number cr = 0; the jjl con- 
straints may be linearly dependent in that case.) o 

Note that we have not given a general formula relating the 
new fi Lagrange multipliers to the A Lagrange multipliers, as 
we only need to show the existence of a rotation to a new set 
of Lagrange multipliers, without constructing it explicitly. It is 
difficult to derive a general formula relating the two sets of La- 
grange multipliers, but for region graphs with only two "genera- 
tions" of regions like those constructed using the junction graph 
method (see appendix A), we can in fact give the relationship 
explicitly: 

*pr{xr) = MP'i*( x *)- (55) 

P'ev(R)\p 

VIII. Detailed Example of a GBP Algorithm 




Fig. 9. A factor graph that we will use for our detailed example of how to 
construct a GBP algorithm. 

We will now give a detailed example of how to construct 
a GBP algorithm. Consider the factor graph drawn in figure 
9, which has seven variable nodes and ten factor nodes. For 
this factor graph, it is convenient to slightly alter our labeling 
conventions so that some of the factor nodes (the ones attached 
to a single variable node) are labeled with a number rather than 
a letter. This factor graph corresponds to the joint probability 
distribution 

p(xi,x 2 ,...,x 7 ) = -| ^n^fo)^ - < 56 > 
/a(*i » £2, £3 , x 5 )/b (xi , X2 , X4,X 6 )fc(Xl , x 3 , X4 , x 7 ) 

We will work out a GBP algorithm making no assumptions 
about the actual forms of the functions, but we note that this 
particular factor graph can be used to represent the probability 
distribution that occurs when decoding a block error-correcting 
code [20]. In particular, if each of the variable nodes is binary, 
with possible states 0 or 1, and the functions At, Jb, and fc are 
parity-check functions (equal to 1 if the sum of their arguments 



are even, and 0 otherwise), then this factor graph corresponds 
to the linear block (7,4,3) Hamming code with parity check 
matrix 

/ 1 1 1 0 1 0 0 \ 
#=1101010. (57) 
\ 1 0 1 1 0 0 1 / 

For the decoding problem, the functions fi(xi) represent the 
likelihoods of the possible states of the bits, in light of the re- 
ceived block from the channel and the assumed channel model. 




Fig. 10. A region graph obtained for the factor graph of figure 9 using the 
cluster variation method. 

To obtain a GBP algorithm, we first need to create a region 
graph. We use the cluster variation method, with largest regions 
{/^,/i,/2,73,/5,1,2,3,5},{/b,/i,/ 2 ,/4,/6,1,2,4,6} and 
{fc, fi , A ? A? A> 1)3,4, 7}. Following the cluster variation 
method prescription for finding intersection regions detailed in 
appendix B, we obtain the region graph shown in figure 10. 

Now that we have a region graph, we need to choose what 
kind of GBP algorithm we want to use and then write down 
the belief and message equations for the GBP algorithm. We 
choose to use the parent-to-child algorithm. 

Note that although the region graph free energy is useful for 
theoretically justifying a GBP algorithm, it will not be necessary 
for constructing the algorithm. Instead, we can work directly 
with the belief equations. 

Recall that in the parent-to-child algorithm, we only have one 
kind of message mp^(xj^) from a parent region to a child 
region. Each region R has a belief 6 j r(x/j) given by equation 
(44) which we re- write here: 

M*ii) oc JI / a (x a ) J| m P -> R (x R ) ... 

a€A R \P£V(R) J 

... I JJ II rnp^ D (x D )\{58) 

\DeT>(R) P'SV(D)\£(R) J 

In words, this equation says that the belief at each region is a 
product of the local factors in that region, the messages from 
parents, and the messages into descendant regions from other 
parents who are not also descendants. 

In our region graph, we have seven regions that can be 
grouped into three types of regions: the three regions exem- 
plified by {/a, A, A> A j Aj 1> 2, 3, 5} that contain five factor 
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nodes and four variable nodes; the three regions exemplified 
by /2, 1, 2} that contain two factor nodes and two variable 
nodes; and the single region {/i,l} that contains one factor 
node and one variable node. 

We will use an abbreviated notation, dropping explicit xr 
dependence, for beliefs and messages and factor functions. The 
notation is best explained with some examples: we write 61235, 
&i 2 and 61 for the beliefs at the regions listed in the previ- 
ous paragraph; we write 77Z35-+12 for the message from region 
{/^,/1,/2,/3,/s, 1,2,3,5} to region {/i,/ 2 , 1,2}, m 2 ->i for 
the message from region { fi , f 2 , 1 , 2} to region {/1 , 1 } , and we 
abbreviate J a {x\ , £2 > ^3 , £5 ) as Ja • 

In this abbreviated notation, the belief equations for the 
largest regions will be 

&1235 OC /A/l/2/3/5^46^12m47-^137n4-^l, (59) 
&1246 °C /B/l/2/4/67n35->12m47->14m3_>i, (60) 



and 



&1347 OC /c/l/3/4/7ra25->13ra26->14m2-+l. (61) 



Note that since these regions do not have parents, all the rele- 
vant messages are into descendant regions from other parents 
who are not descendants. 
The belief equations for the intermediate-sized regions will 

be 

&12 OC /l/2m35_>i2m46->127n3_ f im4_>i, (62) 



&13 OC /l/3"l25-f 13^47-^13^2-^1 7Tl4->l 



(63) 



and 



&14 OC /i/4m26-^4 Tn 37_».i4m2_>im3-4i. (64) 

Finally, the belief equation for the region {/i, 1} will be 

61 oc /im 2 -).im3_nm4_>i. (65) 

The message-update rules are obtained by combining these 
belief equations with the marginalization conditions between 
parent and child regions: 



xp\x c 



(66) 



For example, requiring consistency between the beliefs at the 
region {/1, 1} and the region {/1, / 2 , 1, 2} tells us that 



M^i) = ^2bi 2 (xi,x 2 ) 
from which we obtain 

TO2-U ^ /2^35->>1277l46-»12- 

X 2 



(67) 



(68) 



The other message-update rules, obtained in the same way 
(or equivalently by using equation (48), will be 



m 3-*l : ~ y^/3^25-»13^l47-»13, 
m 4->l : ~ y^/4fft26-H4Wl37-»14> 



(69) 



(70) 



m 3 - + im 3 5_ > i2 := />i/3/5™47->13, 



X3,xa 



m2->im 2 5-fi3 •= fAf*fo m 4*-n2t 



m4->im46->i2 := ^ /B/4/6TO37-H4, 



x 4 ,xe 



7n 2 ->im26-^14 := ^ /b/2/b^35-H2, 



(71) 



(72) 



(73) 



(74) 



m 4 ^im 4 7-H3 •= /c/*/7 m 26->14, (75) 



and 



m3-^im37-^i4 := fchh m *$->w- (76) 



In practice, it often helps convergence to only step the mes- 
sages part-way to their newly computed values. This simple 
heuristic can eliminate "over-shooting" problems. 

We note here one potential practical pitfall to avoid when us- 
ing inertia. Let us suppose that we have a set of old messages 
{m old }, which we use in the update equations to calculate a set 
of messages {m update }, and that we want to set our new mes- 
sages to be half-way between the old messages and the updated 
messages: {m new } = |{m old }+|{m u P date }. We recommend 
when using an update equation with more than one message on 
the left hand side, that all those messages are 7n update equa- 
tions. Mixing in m new or m old messages on the left hand side 
empirically often results in poor convergence properties. For 
example, the update equation (71) given above should explic- 
itly be 



old 
->13- 



(77) 



X 3 ,X 6 



X4 



Fortunately, it is always possible to schedule the message up- 
dates so that one computes the updated messages into the small- 
est regions first (e.g. messages like m^fi te ), so that they are 
available when needed to compute the updated messages into 
larger regions. 

There are many other details that can be handled in different 
ways in iterating the message update equations. For example, 
the messages can be initialized in any way one likes; two pop- 
ular choices are random or uniform messages. The algorithm 
typically terminates after a fixed number of iterations, or after 
some convergence criterion is satisfied, but other termination 
conditions are possible. In a decoding application, one typi- 
cally checks at each iteration whether the thresholded beliefs 
correspond to a code-word, and terminates the decoding algo- 
rithm if they do, stopping otherwise when some fixed number 
of iterations has passed. 

IX. Discussion 

In this paper, we have presented a general theory, based on 
region graphs, for constructing generalized belief propagation 
(GBP) algorithms. Region graphs permit easy visualization of 
the structure of GBP algorithms-messages are always sent be- 
tween the neighboring regions on the graph. For region graphs 
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that have no cycles, the GBP algorithms are exact. We have 
also seen that the fixed points of the GBP algorithm always cor- 
respond to the stationary points of an approximate region-based 
free energy, so that even when the region graph has cycles, GBP 
algorithms seem to be do something reasonable. The standard 
BP algorithm turned out to be a special case of a GBP algorithm 
obtained when the region graph is constructed using the Bethe 
method. 

Given a factor graph and limited computational resources, 
a key remaining problem is how to choose an "optimal" region 
graph-i.e. one that gives the most accurate results with the least 
computational effort. We limit ourselves here to suggesting two 
sensible heuristics. First, it is wise to try to collect the shortest 
cycles in a factor graph into regions, so that they are handled as 
accurately as possible. Second, in order that the region graph 
free energy be as accurate as possible, one should try to make 
the region graph resemble a tree-that is, one should avoid short 
cycles in the region graph. 

We have previously reported promising numerical results ob- 
tained using GBP algorithms for inference on random Markov 
Random Fields [17] and for decoding error-correcting codes 
[39]. 
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Appendix A: The Junction Graph Method 

A natural idea to generalize the Bethe Method is to keep the 
notion that Tl should be the union of a set of large regions TIl 
and a set of small regions TZs, but to let the regions in TZl or 
TZs contain more nodes. The junction graph method, that we 
describe here, exploits this idea, and is based on a generaliza- 
tion of the "junction graphs" that were introduced by Aji and 
McEliece [27]. 

We define a junction graph to be a labeled bipartite graph 
Q = (Vl,Vs,E,L) in which there are large vertices (corre- 
sponding to large regions) v\ £ Vl, small vertices (correspond- 
ing to small regions) v 8 € Vs% and directed edges (or arcs) 
e £ E connecting large vertices to small vertices. The vertices 
in the junction graph are labeled, and the label of vertex Vi is 
denoted L(vi). The labels will be subsets of a set of indices I 
representing factor or variable nodes of a factor graph. 

For the graph Q to be considered a junction graph, we insist 
upon two conditions. First, if v 8 is a small vertex neighboring 
the k large vertices vi x , vj 2 , vi h , then we require that L(v 8 ) is 
a subset of each of L{vi x ) t L(vi^) 9 ....,L(vj J, or equivalently, 
that 

L{v 8 ) c L{v h ) n L(v h ) n ... n L(v lh ). (A-i) 

Secondly, we require that for any index i 6 /, the subgraph of 
Q consisting only of the vertices which contain i in their labels, 
is a connected tree. 

The "junction graphs" introduced by Aji and McEliece [27] 
are a special case of those described here. In their junction 
graphs, small vertices were restricted to have precisely two 



neighboring large vertices, so that the small vertices can be in- 
terpreted as labeled "edges" between the large vertices. They 
further required that small region labels not include any indices 
representing factor nodes. 

Given a set of regions TZjq = TZl U TZs that are organized 
into a junction graph, we can always obtain a valid region-based 
approximation by defining a set of counting numbers cr as fol- 
lows. For all regions R € TIl, we let cr = 1, while for all 
region R E TZs, we let cr = 1 — 6r where o*r is the de- 
gree (numbering of neighboring large regions) of region R. It 
is through this prescription that the arcs the junction graph be- 
come relevant-a small region's contribution to the free energy 
is subtracted out from that of a large region only if the two re- 
gions are connected by an arc. It is straightforward to confirm 
that this prescription for the counting numbers gives us a valid 
region-based free energy approximation, as the junction graph 
condition that the sub-graph for each variable or factor node 
is a tree guarantees that each variable and factor node will be 
counted once as required in equation (26). 

Aji and McEliece proved a theorem that tells us that given 
any set of large regions TZl that contain all the factor and vari- 
able nodes in a factor graph, we can find a corresponding set of 
small regions TZs and organize the regions in TZjq = TZl^TZs 
into a junction graph. Their theorem generalizes without diffi- 
culty to our version of junction graphs. 

As an example, consider the factor graph which we intro- 
duced in the main text and re-draw in figure 1 1. We could take 
as our set of large regions TZl the four regions {A, C, 1, 2; 4, 5}, 
{£,£,2,3,5,6}, {C,£,4,5,7,8}, and {F, 5, 6, 8,9}. An 
acceptable set of corresponding small regions TZs would be 
{2,5}, {C,4,5}, {5,6}, and {8}, with a junction graph as 
shown in figure 11. Because in this case each of the small re- 
gions is connected to two large regions, they would each have 
an counting number of —1. 




Fig. 11. A junction graph (on the right) for the factor graph on the left. 

The set of regions given by the Bethe method can also always 
be organized into a junction graph (though not necessarily the 
restricted Aji-McEliece version of a junction graph); using as 
an example the same factor graph, the resulting junction graph 
is shown in figure 12. It is obvious from this example that there 
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Fig. 12. A junction graph for the factor graph shown in figure 3 generated 
using the Bethe method. Note the isomorphism between this junction graph 
and the original factor graph. 



will always be a one-to-one isomorphism between the origi- 
nal factor graph and the corresponding junction graph obtained 
from the Bethe method. 

The junction graph approximation for the Gibbs free energy 

F JG ({b R }) = U JG ({b R }) - H JG ({b R }), (A-2) 

where 

U JG ({b R }) = + E (l-*t)W*). (A-3) 

R €TIl R eKs 

and 

H JG ({b R })= h *0>r)+ £ (i-**)k*0>*)- 

R en L R en s 

(A-4) 

Junction graphs are a special case of region graphs, where 
there are only two "generations" of regions. It follows that 
minimizing the junction graph free energy Fj G will once again 
give beliefs {b R } that are equivalent to those obtained from a 
message-passing BP algorithm. That algorithm is sometimes 
known as the generalized distributive law [24]. Again it follows 
as a corrollary of our more general results for region graphs that 
the junction graph approximation to the Gibbs free energy will 
be exact, and the generalized distributive law will give exact re- 
sults, when the junction graph is a tree. In that case, we can call 
the junction graph a junction tree, and the generalized distribu- 
tive law reduces to the famous junction tree algorithm. 

Our junction trees are actually a slight generalization of what 
is normally called a "junction tree," in that we allow separators 
(i.e., the small regions) to neighbor more than just two large 
regions. We can generalize the well-known result [13] for the 
joint probability function in junction trees to our case and obtain 



PW ~ linens P«(**) d «- 1 



(A-5) 



To obtain this result, we note that while we have described 
region graphs and junction graphs as directed graphs, from the 



point of view of statistical grphical models, they are equivalent 
to undirected graphs. In particular, one can re-write the full 
joint probability distribution p(x) for a factor graph in the form 

p W = \ II •*s( x *i x *) II < A - 6 ) 
( R S) R 

where (RS) denotes pairs of connected regions in a given re- 
gion graph for that factor graph. Specifically, when we set 
$*(xh) = {UaeAn /a( x a)) CR and 9ns{*n,xs) equal to 1 
if x R is consistent with xs and equal to 0 otherwise, this form 
of the joint probability distribution will be equivalent to the one 
in the original factor graph form. Since the formula (A-5) is 
true for pairwise Markov Random Fields when the set of nodes 
in TZl are separated by the set of nodes in IZs, and we have 
shown how to convert a region graph into an equivalent pair- 
wise Markov Random Field, we have justified using formula 
(A-5) for region graphs as well. 

Appendix B: The Cluster Variation Method 

Another method for selecting a valid set of regions TZ and 
counting numbers c R is the cluster variation method introduced 
by Kikuchi in 1951 and further developed in the physics lit- 
erature since then [19]. The main feature distinguishing this 
method from the junction graph method is that TZ may be the 
. union of more than just two generations of regions. 

In the cluster variation method, we begin with, a set of dis- 
tinct large regions TZo such that every factor node a and every 
variable node i in our factor graph is included in at least one 
region R £ IZq. We also require that no region R £ TZo be 
a subregion of any other region in TZo- We then construct the 
set of regions TZ\ by forming all possible intersections between 
regions in TZo, but discarding from TZ\ any intersection regions 
that are sub-regions of other intersection regions. If possible, 
we then construct in the same way the set of regions H2 from 
the intersections between regions in TZ\ . As long as there con- 
tinue to be intersection regions, we construct sets of regions 
7^3, Ha, -TZk in the same way. Finally, the set of regions used 
in the cluster variation method will be TZ = T^oUTJi U...U7?tc« 

We define the counting numbers in the cluster variation 
method to be 

c* = l- £ °s CB-1) 
ses(K) 

where S(R) is the set of all regions which are super-regions of 
region R. 

Returning to our example factor graph drawn in figure 3, we 
can choose the base set of regions TZo to consist of the four 
regions {A, C, 1, 2,4,5}, {27,23,2,3,5,6}, {£,£,4,5,7,8}, 
and {£), F, 5, 6, 8, 9}. Once the set of base regions TZo is cho- 
sen, there is no further choice in the cluster variation method. 
In our case, the set of intersection regions TZi would be the 
regions {2, 5} {C, 4, 5}, {Z>, 5, 6}, {5, 8}, and the set of inter- 
section regions K2 would be {5}. 

Each of the regions RGTZo would have an counting number 
c R = 1. Because each of the regions R G TZ\ is the subregion 
of two regions in TZo, they each have an counting number of 
c# = l — 2 — — 1. Finally since every region in TZo and TZ\ is 
a super-region of {5}, its counting number is 1— 4-1-4=1. 
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We can represent this set of regions and counting numbers 
with the region graph shown in figure 13. 



A.C1.2A5 




B£>,23,5.6 




C,E,4,5,7,8 




D.F.5,6,8,9 




Fig. 1 3. A region graph generated using the cluster variation method. 

Note that the Bethe approximation will be a special case of 
the cluster variation method if and only if no factor node shares 
more than one variable node with another factor node (or equiv- 
alently, there are no cycles of length four in the factor graph.) 
The factor graph shown in figure 13 is therefore one example 
of a factor graph for which the Bethe approximation can not be 
generated by the cluster variation method. 

We remark that in the physics literature, the cluster variation 
method has normally been applied to a restricted class of fac- 
tor graphs that are particularly relevant as models of magnetic 
materials. In particular, the factor graph normally represents a 
translationally invariant crystal lattice, and the factor nodes nor- 
mally have degree two, corresponding to two-body interactions. 
Translational symmetry in the factor graph often dramatically 
simplifies the problem of minimizing the Kikuchi free energy, 
and when the factor nodes have degree two, the Bethe method 
will always be a special case of the cluster variation method. 




Fig. 14. For this factor graph, the choice of regions {A, 1, 2, 4}, { J9, 1, 3, 5}, 
{C, 2, 3, 6}, and {1,2, 3}, with corresponding counting numbers of 1,1.1, and 
—1, will give a valid region-based approximation that cannot be represented by 
a region graph. 



Note, however, that although the region graph method is the 
most general method we have introduced, there do exist valid 
region-based free energy approximations that do not have a re- 
gion graph representation. We demonstrate an example in fig- 
ure 14. 




Appendix C: Relationships Between Different 
Methods 

In this appendix, we summarize the relationships between 
the different methods for generating valid sets of regions for a 
region-based free energy approximation. First of all, as is clear 
from its definition, a junction graph will always be a region 
graph (though the converse is not true). The sets of regions and 
counting numbers generated by the cluster variation method can 
also always be represented by a region graph. We already saw 
one example in figure 13. 

We emphasize that one can construct region graph approxi- 
mations that cannot be generated with either the junction graph 
or cluster variation methods. We already saw such an example 
when we introduced region graphs in the main text in section 
VI. Constructions that are more general than those constructed 
using the cluster variation method or the junction graph method 
may be useful for a variety of reasons, including reducing the 
computational complexity of a GBP algorithm. 



Fig. 15. A Venn diagram illustrating the relationships between different meth- 
ods of generating valid region-based free energy approximations. The Bethe 
method is always an exemplar of the junction graph method, but is only a spe- 
cial case of the cluster variation method if the factor graph has no pair of factor 
nodes that share more than one variable node, and is only a special case of Aji 
and McEliece's junction graph method if the relevant factor graph is a Forney 
"normal" graph (no variable node is connected to more than two factor nodes). 

In summary, we have the following relationships, as illus- 
trated in the Venn diagram of figure 15. For a given factor 
graph, the cluster variation method and the generalized junc- 
tion graph method each generate valid region-based free en- 
ergy approximations that are subclasses of all the possible valid 
approximations. Neither the cluster variation method nor the 
generalized junction graph method is more general than the 
other, and both are subsumed by the more general region graph 
method. The set of regions generated by the Bethe method is 
always an examplar of those generated by the junction graph 
method, and will be an examplar of those generated by the clus- 
ter variation method if and only if the factor graph has no cycles 
of length four. In general, the Bethe method will not be a spe- 
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cial case of the Aji-McEliece junction graph method, though it 
will be for factor graphs such that each variable node is adjacent 
to no more than two factor nodes (Forney's so-called "normal" 
factor graphs [21]). 

In addition to being a more general method than the clus- 
ter variation method or the junction graph method, we feel that 
the region graph method is easier to understand on an intuitive 
level. We simply select a set of regions and counting num- 
bers such that every variable and factor node gets counted once, 
and such that we can enforce consistency for the belief over 
any variable node, no matter which region we choose. Region 
graphs also have the important advantage of being a natural 
graphical structure for describing generalized belief propaga- 
tion algorithms. 

Pakzad and Anantharam have suggested strengthening the re- 
gion graph requirements described in section VI so that for ev- 
ery subset of variable nodes in the factor graph, the sub-graph 
of regions containing that sub-set must be connected and must 
have a sum of counting numbers equal to one [29]. Such a 
strengthening would ensure that the beliefs computed for any 
sub- set of nodes would be consistent, no matter which regions 
were used to compute it. The cluster variation method produces 
region graphs that satisfy these stronger requirements, but we 
chose not to insist on these stronger requirements in general, 
because region graphs created using the Bethe Method will not 
necessarily satisfy them. 

Appendix D: The Child-to-Parent Algorithm 
The observation underlying the "child-to-parent algorithm" 
is that when we minimize the Bethe free energy, the La- 
grange multipliers enforcing the marginalization constraints 
correspond exactly (after exponentiation) to the rii^ a (xi) mes- 
sages from variable nodes to factor nodes in the BP algorithm. 
Considering these messages as messages from child regions to 
parent regions in a region graph, we can try to generalize the 
approach to arbitrary region graphs. Thus, we construct a GBP 
algorithm by simply minimizing a region graph free energy and 
identifying Lagrange multipliers that enforce consistency be- 
tween beliefs with messages from child regions to parent re- 
gions. Such an approach was considered in detail by Kappen 
and Wiegerinck for region graphs constructed using the cluster 
variation method [37]. 

We begin with the stationary point equations obtained from 
differentiating a Lagrangian L that represents a region graph 
free energy i*k({&/i}) with beliefs {b R } that are constrained 
to be consistent with their neighbors on the region graph. We 
obtained this equation previously (see equation (51)), and re- 
write it here: 

c R In b R (xfl) =<y R + c R ^ ln/ 0 (x 0 )... 

a€A R 

- £ X PH (x H ) + £ A* c (xc), (D-l) 

PeV(R) C€C(R) 

where V{R) is the set of regions that are parents of region R, 
and C{R) is the set of regions that are children of region R, and 
Apk(xh) are the Lagrange multipliers that enforce consistency 
between the beliefs in region P and those in region R. 



For cr ^ 0, we can re-write this equation as 

(D-2) 

where nc^p(xc) = exp(A/>c(xc)) is a "message" from a 
child region C to a parent region P, in analogy with the mes- 
sages ni-+ a (xi) in standard BP. If cr = 0, we do not get a con- 
dition on 6ij(xii) (Ir^x-r) can still be determined from beliefs 
in super-regions via the marginalization conditions); instead we 
obtain the following condition on the messages into and out of 
region R: 

(n cmR) nc^xc)\ =i (D3) 
\ilpev(R) nR^p{xR) J 
The message update rules are then obtained by applying the 
marginalization conditions bc{x-c) = H Xp \xc *H x p)- 

A small example might help clarify the meaning of these 
equations for the reader. Consider the probability distribution 

p(x u x 2y x 3 ) = ^fA(xi,x 2 )fB(x2,x 3 ). (D-4) 

We use the Bethe approximation, which should be exact in this 
case because the factor graph is a tree. Thus, we obtain large 
regions {A, 1, 2} and {B, 2, 3}, with counting numbers 1, and 
small regions {1}, {2}, and {3}, with counting numbers 0, 1, 
and 0 respectively. We obtain the following belief equations for 
the regions with c R ^ 0: 

bA{xi } x 2 ) oc />i(xi,X2)ni^A(a:i)n2-v>i(x2), (D-5) 

bB(x2,x 3 ) OC fB(x2,X 3 )n 2 ^B{X2)n3^B(x3), (D-6) 

b 2 {x 2 ) « n 2 ^A{x 2 )n 2 ^ B (x 2 \ (D-7) 

and the following conditions on messages for the regions with 
cr = 0: 

rn-^xi) = 1, (D-8) 

and 

n 3 -> B (x3) = 1. (D-9) 

Using these conditions and the marginalization conditions, we 
find that 

n2^A{x 2 ) = ^2fB{x 2 ,x 3 ), (D-10) 
and _ 

n 2 ^ B (x 2 ) = YlfA{Xl,X2). (D-ll) 
x\ 

We can now easily check that in this example, the computed 
beliefs give back the desired marginal probabilities exactly. 

The child-to-parent algorithm, by its construction, clearly 
gives a generalized BP algorithm whose fixed points correspond 
to the stationary points of the region graph free energy. On the 
other hand, it might be considered inelegant both because it fo- 
cuses only on the messages from child regions to parent regions 
and because the message update equations will inevitably be 
complicated and involve the counting numbers cr. The two- 
way algorithm described in Appendix E and the parent-to-child 
described in the main text in section VII-A are different GBP 
algorithms that attempt to ameliorate these flaws. 



17 



Appendix E: The Two- Way Algorithm 
To motivate the two-way algorithm, we return to the standard 
BP algorithm, where we recall that the belief equations can be 
written in the form 



and 



tiki) = n m *-^( x *) 

aeN(i) 



(E-l) 



and 



where 



6a(Xa) = /o(Xa) <* 2 > 
ieN(a) 



ni^a(xi) = JI m^<(xi). (E-3) 

66N(»)\o 

Given these equations, it is natural to aim for a generalization 
where the belief equations will have the form 

M x *) = /*( x *) II n c->*(*c) II m ^^(xp). 
cecfi*) Pev(R) 

(E-4) 

In other words, we aim to write the belief equations so that 
the belief in a region is a product of local factors, and mes-- 
sages arriving from all the connected regions, whether they are 
parents or children. It will turn out that we can do this, but 
in order that the GBP algorithm be correspond to the region 
graph free energy, we will need to use modified factors and a 
rather complicated relation between the nc->p(xc) messages 
and mp^c(xp) messages generalizing the relation for stan- 
dard BP given in equation (E-3). 

It will be convenient to denote the number of parents of re- 
gion R by pr, and define the numbers q R = (1 — c r )/p r and 
(3r = 1/(2- g r ). When a region has no parent so that pr — 0 
and cr = 1, we take qR = (3r = 1. Note that within the Bethe 
approximation, qR = /3r = 1 for all regions. We will assume 
that qR ^ 2 so that Pr is well-defined (normally, if one has a 
region graph with a region such that qR = 2, one should be able 
to change the connectivity of R to avoid this problem). 

We first define the set of pseudo-messages for all regions R 
and their parents P and children C: 

»3Lf(x*) = (E-5) 
Ir(^r) JJ mp'^RtxR) Yl n C -+R{*c) 



P'€V(R)\P 



cec(R) 



and 



P€V(R) 



C'eC(R)\C 



where f R (x R ) = {U a€Ar /a(x a )) CH . 

Aside from the fact that we raised the product of the local 
factors to a power of cr, these pseudo-messages are what one 
would naively expect the message updates to look like. To ob- 
tain the true message updates, however, one needs to combine 
the pseudo-messages going in the two directions of a link as 
follows: 



mp_„(x«) = (nS^x*))'"- 1 K>-**(x«)r (E-8) 

Note that when @r = 1, the messages are precisely the same as 
the pseudo-messages. 

The two-way algorithm is completed by the belief equations, 
which have the form already given in equation (E-4). We now 
claim that the above sets of messages and beliefs are fixed 
points of two-way GBP if and only if they are stationary points 
of the region graph free energy. 

Proof: We form a Lagrangian from the region graph en- 
ergy as already indicated in the previous section on the child- 
to-parent algorithm. If we exponentiate equation (51) derived 
there, we obtain the equation 



\0R 



6h(x h ) c » oc f R (x R ) J] e AHc(xc) II eAPR(XH) 

CeC(R) \P€V(R) 

(E-9) 

Suppose that we are given a set of A and 6j? that satisfy these 
stationary conditions of the Lagrangian. Now we define 

n** P (xn) = e Xp «^ (E-10) 



and 



mp-*R(* R ) = bR(x R y«e- x »«^ (E-ll) 



m°R^ c (xc) = (E-6) 
$2 /ii(xie) 1] ™>p-+r(xr) II n C'->*(xc)f 



Of course, we have one m message and one n message for 
every Lagrange multiplier A, so for these definitions to hold, 
we also need to have constraints relating the m's and n's. 
The constraints will be given by the definitions of the pseudo- 
messages and the relations between the messages and the 
pseudo-messages that we defined above. We want to show that 
these relations, as well as the two-way GBP belief equations 
previously defined, must hold. 

First, we show that the belief equations (E-4) hold. We have 

Mx*) ch oc/*(xk) H eARC(Xc) II e- ApH(xR) 

C£C(R) P€V(R) 

oc/*(x H ) n nc^(xc) n (j^Y* 771 ^*^ 

cec(R) pev(R) v RK R)J 

K(bR(xR)r qRPR fR(x R ) J] »c->Jt(xc) II ™^*( x *) 

cec(R) Pev(R) 

oc (b R (xR)) CR " 1 f R (x R ) Y[ rcc->fl(xc) J] ™p^r{*r) 

C€C(R) P£V(R) 

so that indeed bR(xR) is product of local potentials and incom- 
ing messages. 

Turning to the constraints, we have from the definition of 
ti 0 r_>p(xr), that 



n R ^p(x R ) = {n 0 R_ P (xR)) 



Pr 



(m°p^R)f R ' 1 (E-l) 



n Q R^p(^R)m P ^R(xR) = b R (x R ) 

xp\x b 

= n R ^p(xB.)m 0 p_> R (-x.R). 



(E-12) 
(E-13) 

(E-14) 
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Equations (E- 10) and (E-ll) imply that 

n R ^ptxR)mp->ii(xR) = b R (xR) qR (E-15) 
= {n° R ^ P (xR)m P ^R(xR)) qR . (E-16) 

Together these equations give us two equations for the two 
unknowns mp_*ji(xje) and tir-+p{xr): 

mp -* (X * } = ^WxH)"' fl (E-17) 

and 

n^ptoW^x*) 1 "'* = (nSj_ fP (x«)) ?B (E-18) 

The unique solution of these equations is given by equations 
(E-7) and (E-8). Thus, we have shown that the message passing 
algorithm previously defined has fixed points that are equivalent 
to the stationary points of the region graph free energy. 

The two-way algorithm will be particularly elegant when 
/it(xH) = IrIxr) and when fa = 1 for all regions. In that 
case, each region will send messages to all adjacent regions, 
and the message update rules will be the natural generalization 
of the ordinary BP rules written with two kinds of messages. It 
is interesting to note that the condition that }r(xr) = /h(x/j) 
can be ensured by requiring that only regions with no parents 
contain factor nodes, while the condition that fa = 1 for all 
regions can be ensured by requiring that the sub-graph obtained 
by taking any region and all of its ancestor regions must always 
form a tree. 

When fa = 1 for all regions, the two-way GBP algorithm 
is equivalent to Pearl's method of clustering [9]: we form new 
nodes from clusters of variables in the original graph (these are 
the regions) and run an ordinary BP algorithm on the result- 
ing graph. It is important to bear in mind that this equivalence 
only holds for a subset of possible region graphs: if one uses 
this method on a set of regions that does not satisfy the region 
graph conditions, or on a region graph for which fi r / 1 for 
some regions, the resulting beliefs will generally be a poor ap- 
proximation. 



c A =1\ c fl =f 



c B =0 



/ 




Fig. 16. An illustration of how one can take a region graph with some regions 
that have counting number zero, and obtain another region graph with no such 
regions but with an identical free energy. One first removes regions with a 
counting number of zero, and then direcUy connects any ancestor-descendant 
pairs that have become disconnected. In this example, we form new direct 
connections between regions R and H and between regions B and H. 



equal to zero, and when one implements it, one finds that the 
results at its fixed points are identical to those obtained when 
one surgically removes the c R = 0 regions. The reason that 
the algorithm still gives proper results, even though the above 
proof breaks down, is that the A constraints that cannot be de- 
rived from the /x constraints are actually not necessary-they all 
involve cr = 0 regions that do not contribute to the free energy 
in any case. 



□<• □■ □< 
\/\/ 

\/ 



Appendix F: Region Graphs with c r = 0 Regions 

In our proof that the fixed points of the parent-to-child GBP 
algorithm are equivalent to the stationary points of the region 
graph free energy (given in section VII-A), we assumed that 
no region has counting number cr = 0. That is never diffi- 
cult to arrange: if one has a region graph with regions whose 
counting number equals zero, one can remove them, and then 
connect directly any regions that were previously ancestors or 
descendants of each other, but are no longer after the removal 
of the cr = 0 regions. The remaining regions will have iden- 
tical counting numbers by construction, and since the regions 
with cr = 0 did not contribute to the region graph free energy 
in any case, it will be unchanged. In figure 16, we illustrate 
the "surgery" that needs to be performed on a region graph to 
remove regions with counting number zero. 

In fact, however, the parent-to-child algorithm is well- 
defined even when some of the regions have counting numbers 



Fig. 17. A small illustrative region graph (see text). Note that region F has 
counting number cp = 0. 

A small example may make this point more comprehensi- 
ble. Consider the small region graph shown in figure (17). The 
counting numbers of the regions are ca = cb = cc — 1» 
c D = ce = -1. and cf — 0, so that region F could clearly be 
removed to obtain an equivalent region graph. For the purpose 
of illustration, we leave it in. We have six A constraints, each 
of which is very straightforward. For example, the constraint 
associated with \ad(*d) is &d(xz>) = E x> »\xd ^( x >*)' 
while the constraint associated with \df(xf) is &f(xf) = 
£x D \x P Mxjp. 

The six p constraints are somewhat less straightforward. Go- 
ing back to the prescription given in equation (53), we see for 
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example that the constraint associated with Had(*d) is 

c D b D (x D )+CB 2 M x *)=° (F' 1 ) 

xb\xo 

or equivalently, 

Xfl\xD 

while the constraint associated with iidf(*f) is 
c F b F (x F )+CE ^2 h E{xE)+c c ^2 M x c) = 0 (F-3) 

Xfi\xF Xc\XF 

or equivalently 

£ W*c) = £ Mxs). (F-4) 

xc\xf X£\Xf 

Because cjr = 0, there will not be any /x constraint directly 
involving ^(xj?), so we cannot derive some of the A con- 
straints. On the other hand, these constraints are not necessary, 
because the region graph free energy itself also does not depend 
directly on 6j?(xf)- We also see that the ft constraints are still 
sufficient to ensure that all the beliefs are consistent when they 
are marginalized down to region F. Finally, if we do surgery 
on this region graph and remove region F, we can then easily 
verify that the A constraints are then entirely equivalent to the 
[x constraints. 
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